A chromosome-level haplotype-resolved genome assembly of oriental tobacco budworm (Helicoverpa assulta)

Oriental tobacco budworm (Helicoverpa assulta) and cotton bollworm (Helicoverpa armigera) are two closely related species within the genus Helicoverpa. They have similar appearances and consistent damage patterns, often leading to confusion. However, the cotton bollworm is a typical polyphagous insect, while the oriental tobacco budworm belongs to the oligophagous insects. In this study, we used Nanopore, PacBio, and Illumina platforms to sequence the genome of H. assulta and used Hifiasm to create a haplotype-resolved draft genome. The Hi-C technique helped anchor 33 primary contigs to 32 chromosomes, including two sex chromosomes, Z and W. The final primary haploid genome assembly was approximately 415.19 Mb in length. BUSCO analysis revealed a high degree of completeness, with 99.0% gene coverage in this genome assembly. The repeat sequences constituted 38.39% of the genome assembly, and we annotated 17093 protein-coding genes. The high-quality genome assembly of the oriental tobacco budworm serves as a valuable genetic resource that enhances our comprehension of how they select hosts in a complex odour environment. It will also aid in developing an effective control policy.

evolutionary patterns in insect feeding habits and elucidating the underlying mechanisms governing interactions with host plants.
This study presents a high-quality haplotype-resolved genome assembly of H. assulta at the chromosome level, achieved through the use of PacBio long reads, nanopore ultra-long reads, and high-throughput chromosome conformation capture (Hi-C) data.Utilizing Hifiasm 8 , we created three haplotype-resolved draft genomes: primary, paternal, and maternal, their genome sizes were 441.6 MB, 395.38    respectively.Following the correction of sequence errors and removal of haplotigs, the primary genome now stands at 415. 19 Mb in size, with a contig N50 length of 13.99 Mb.Notably, all 33 primary contigs were successfully anchored onto 32 chromosomes, encompassing both Z and W sex chromosomes.Furthermore, the genome assembly exhibited a high degree of completeness, as evidenced by the BUSCO analysis, which revealed 99.0% gene coverage.Repeat sequences constituted 38.39% of the genome assembly.A total of 17,093 protein-coding genes were identified, with 16,889 being functionally annotated.Transcriptome analysis indicated that 14,681 genes were expressed in at least one sample.

Sample collection.
The larvae of H. assulta were collected from tobacco fields in the Xu Chang campus of Henan Agricultural University (113.80°E, 34.13° N) and reared continuously for more than ten generations in the laboratory.The insects were reared on an artificial diet under controlled conditions at 26 ± 1 °C, with a 14:10 (L:D) photoperiod cycle and 85% ± 5% relative humidity.Pupae and newly molted adults were selected for sequencing, and the adult insects that were used for sequencing had their wings removed before the process.Genome sequencing and size estimation.The genomic DNA for PacBio HiFi sequencing was extracted from a newly molted female adult using the QIAamp DNA Mini Kit (QIAGEN).The DNA's integrity was assessed using the Agilent 4200 Bioanalyzer (Agilent Technologies, Palo Alto, California).Subsequently, 15 μg of genomic DNA was sheared using g-Tubes (Covaris) and concentrated with AMPure PB magnetic beads.Each SMRT bell library was prepared using the Pacific Biosciences SMRTbell express template prep kit 2.0.The constructed libraries underwent size selection on a BluePippin ™ system for molecules ≥ 15Kb, followed by primer annealing and the binding of SMRT bell templates to polymerases using the DNA/Polymerase Binding Kit.Sequencing was performed on the Pacific Bioscience Sequel II platform for 30 hours at the Annoroad Gene Technology company.Finally, a total of 1,933,848 high-quality HiFi reads were generated with a combined length of 38,425,265,244 bp; the detailed information about HiFi reads is listed in Table 1.
To perform Illumina second-generation DNA sequencing, one newly molted adult female and its parents were collected and rinsed with pre-cooled 0.9% saline to contamination, and frozen with liquid nitrogen.Genomic DNA was extracted from the collected samples using the sodium dodecyl sulfate (SDS) extraction method.After testing the DNA quality and integrity, it was randomly sheared by a Covaris ultrasonic disruptor.Illumina sequencing pair-end libraries were prepared using the Nextera DNA Flex Library Prep Kit (Illumina, San Diego, CA, USA).Sequencing was performed using the Illumina NovaSeq 6000 platform (Illumina, San Diego, CA, USA).Raw reads were filtered using Fastp 9 software (version 0.21.0) with the following criteria: removal of reads with adapter contamination, removal of reads with an N proportion greater than 5%, and discarding reads with a low-quality base count of 50% or more, where the quality value is less than or equal to 19 (Table 2).
The Hi-C libraries were constructed using standard protocols as previously described 10 , with one newly molted female used as the input.The Hi-C sequencing library was then amplified by PCR (12-14 cycles) and sequenced on the Illumina HiSeq instrument, generating 154,726,330 paired clean reads with 2 × 150-bp reads.
We collected female pupae specifically for the construction of Oxford Nanopore libraries.The libraries were prepared using the standard protocol for Oxford Nanopore sequencing, specifically the Ultra-Long  DNA Sequencing Kit protocol (SQK-ULK001).The purified library was loaded onto primed R9.4 Spot-On Flow Cells and sequenced using a PromethION sequencer (Oxford Nanopore Technologies, Oxford, UK) with 72-hour runs at Novogene Corporation Inc., Tianjin, China.Basecalling of raw fast5 format data was performed using Oxford Nanopore GUPPY 11 software, removing low-quality reads with a sequencing quality value (Q) less than seven and retaining high-quality pass reads.The quality assessment report was generated using NanoPlot 12 v1.38.1.Finally, 254,496 Oxford Nanopore raw reads were generated with a combined length of 25,843,417,397 bp, and the detailed information is listed in Table 3.Genomic characteristics, such as genome size, repeat content, and heterozygous rate, were estimated based on K-mer frequencies.Utilizing K-mer analysis (K = 21) of Illumina short reads and PacBio HiFi long reads with Jellyfish 13 v2.3.0,we estimated the overall genome size of H. assulta to be approximately 350 Mb using genescope2.0 14 .For the Illumina short-read reads, the genome size was estimated to be 354.98Mb, with a heterozygosity rate of 2.08%; For the PacBio HiFi reads, the genome size was estimated to be 348.38 Mb, with a heterozygosity rate of 2.04% (Fig. 1).

Genome assembly.
The chromosome-level haplotype-resolved genome assembly with trio binning was achieved using Hifiasm 8 v0.19.5 software; this involved incorporating Illumina short paired-end reads from the parents, Illumina Hi-C paired-end reads, ultra-long ONT reads, and Pacbio HiFi reads.The primary contigs and two other haplotypes (paternal and maternal) contigs assembled by Hifiasm were further refined using Nextpolish2 15 v0.1.0software.This refinement process involved the use of PacBio long HIFI reads and Illumina short reads, resulting in the production of three draft genome assemblies.
Certain regions in a genome with high genetic diversity result in separate primary contigs for each haplotype instead of a single contig with an associated haplotig 16 .Whether you are working on the haploid or Fig. 3 The alignment rate of the RNA-seq data.The RNA-seq data with a dark red colour (a) comes from this genome sequencing project; the data with a dark grey colour (b) was downloaded from the NCBI SRA database.The value at the red dash-line is equal to 85. phased-diploid assembly, this can be an issue for downstream analysis.Hifiasm 8 is a powerful assembler that can generate high-quality chromosome-level assemblies.Compared to other assemblers, it produces longer contigs and can resolve more segmental duplications.By using Hifiasm, we created three haplotype-resolved draft genomes: primary, paternal, and maternal, their genome sizes were 441.6 MB, 395.38  The expression matrix 59 has been deposited into figshare.com.
respectively (Table 4).Although Hifiasm can eliminate most duplications between haplotigs, it may incorrectly identify or fail to distinguish some heterozygous sequences.To address this issue, we used the Purge Haplotigs 17 v1.1.2software with long HiFi reads to remove haplotigs remaining in the three draft assembled genomes.Assembly completeness was estimated by BUSCO 18 v5.4.7 analysis and Illumina short reads mapping; the lineage dataset used in BUSCO is insecta_odb10, and bowtie2 19 v2.5.1 software was used to align the purged genome assembly.The analysis identified 99.0% (single-copied genes: 98.7%, duplicated genes: 0.3%), 0.5%, and 0.5% of the 1,367 predicted genes in this genome as complete, fragmented, and missing sequences, respectively.These results suggested that the assembled genome is highly complete.
Genome scaffolding.These high-quality Hi-C sequencing clean reads were mapped to the trimmed draft genome using BWA 0.7.17 20 and filtered for unmapped and multiple mapped reads using Samtools v1.16 21 .The unique, high-quality paired-end reads mapped close to the restriction sites were retained for downstream analysis in the juicer 22 v1.6 and 3d-dna 23 v180922 pipeline.Juicebox 23 was used to cluster the contigs into groups, and the order of the contigs was confirmed based on the strength of interactions between read pairs.During the process of grouping contigs based on Hi-C data, we observed that 33 contigs were grouped into 32 clusters (Fig. 2), with only one cluster (Chr16) containing two contigs.To ensure the accuracy of the connection between these two contigs, we used paired-end information from the sequencing data, if there are telomere repeat sequences  present, confirm that they are located at the ends of the sequence.After correcting sequence errors and removing haplotigs, the final genome stands at 415.19 Mb, with an average length of 12.58 Mb after scaffolding (Table 5).
rNa sequencing and analysis.We collected fourth and fifth instar larvae, female and male pupae, and newly emerged male and female adult moths for transcriptome sequencing and gene expression analysis.Before preparation and sequencing, we removed the midguts of the larvae and the wings of the adults.Subsequently, total RNA was extracted from the aforementioned samples using Trizol reagent (Invitrogen, USA) following the manufacturer's protocol.Illumina RNA sequencing libraries were prepared by Annoroad Gene Technology Company.We performed RNA sequencing on 18 samples and obtained RNA-seq data with a total length of about 1209 gigabytes (Table 6).The total number of sequences is around 807 million, with an average proportion of bases having a quality greater than Q30 at 93.9% and an average proportion of clean reads at 97.25%.Clean data was obtained by removing adapters, low-quality reads, and high-content unknown sequences.All RNAseq data sequenced in this project have been deposited into the European Nucleotide Archive (ENA) with accession number PRJEB7091153.In addition to our sequencing data, we downloaded 39 transcriptome datasets from the NCBI SRA database which were merged with our dataset.Each sample's data was aligned to the genome using HISAT2 24 to assess gene transcription levels.Analysis shows that more than half of the transcriptome samples exhibit a genome alignment rate of over 85%, and the genome alignment rate of the samples in this project (group a) is consistently around 90% (Fig. 3).
addition, we identified 86 rRNAs and 62 tRNAs.The Circos plot of the functional element we identified is shown in Fig. 4. All annotation files have been deposited into figshare.com 38.

Sex chromosomes analysis.
To identify the sex chromosomes (Z and W chromosomes) in H. assulta, we resequenced one female pupa and one male pupa using Illumina HiSeq platforms to obtain an approximate 50 × coverage.In males, the normalized coverage levels of sequence reads from the Z chromosome should be twice that of females.On the other hand, ideally, males do not have any DNA contribution from the W chromosome, while the autosomes should have equal coverage between males and females.Therefore, a difference in sequencing coverage ratio is expected for both Z and W chromosomes between sexes but not for autosomes.This difference can be used to identify sex-linked chromosomes.Using salmon 39 , we computed the normalized coverage levels of chromosomes by mapping the resequencing reads to the final H. assulta genome with default parameters.To analyze and visualize the log2 of the male: female (M: F) coverage ratio, we used the R package changepoint v2.2.4 (https://github.com/rkillick/changepoint/).Remarkably, among all the chromosomes, it was observed that the sequencing depth of the longest chromosome (Chr1) is twice as high in males compared to females, leading to the conclusion that Chr1 is the Z chromosome (Fig. 5).Ideally, the length of the W chromosome should be similar to that of the Z chromosome and exhibit shallow sequencing depth in males.Only the second-longest chromosome (Chr2) meets both criteria, thus leading to the conclusion that Chr2 is the W chromosome. Synteny analysis.To compare the genomic arrangement of H. assulta with its closely related species, cotton bollworms (H.armigera), we used annotated protein sequences anchored on chromosomes to perform synteny analysis through MCScanX 40 with default parameters.From the NCBI genome database, we obtained the reference genome HaSCD2 data (accession number: GCF_023701775.1) of cotton bollworms.The analysis showed that most of the chromosomes of the two moths exhibited good collinearity, with only a few chromosome fragments undergoing fission and fusion events.For example, although most of Chr6 of H. assulta was syntenic to Chr4 of H. armigera, a small part was syntenic to Chr29.We visualized the results using Tbtools 41 .Due to the absence of the W chromosome in the cotton bollworm reference genome, we did not observe any collinearity between the W chromosome of the H. assulta and any chromosome in the cotton bollworm genome (Fig. 6).

Phylogenetic reconstruction.
To establish the evolutionary relationship between the tobacco budworm and other closely related species, we retrieved protein sequences of six species belonging to the Noctuidae family and one Coleopteran insect (T.castaneum) from the NCBI genome database and only the longest transcript for each gene was taken into consideration.OrthoFinder 42 v2.5.4,with DIAMOND 43 v2.1.8,was used to identify orthologs and homologs.OrthoFinder successfully assigned 125918 genes (96.9%) to 14619 orthogroups.At least 50% of all genes belonged to orthogroups with eight or more genes (G50 was 8) and were contained in the largest 5245 orthogroups (O50 was 5245).There were 6498 orthogroups with all species present, and 2822 of these consisted entirely of single-copy genes.
For the phylogenetic analysis, we constructed a maximum likelihood phylogenetic species tree using the STAG method in the OrthoFinder 42 program, rooted in STRIDE 44 .Multiple sequence alignments of single-copy gene families were performed using MAFFT 45 v7.520 with the "-auto" parameter, and the alignment results were trimmed using trimAL 46 v1.4.rev15 with the "-automated1" setting.The alignments of all single-copy orthologs were concatenated to form a supergene.
We then utilized the mcmctree from the PAML 47 package to estimate the divergence time of the species in the tree.Divergence information obtained from the TimeTree 48 database (S. frugiperda vs S. litura 16.9-19.1MYA, N. ni 70-80, and T. castaneum 195-361.6MYA) was combined with mcmctree to constrain the divergence estimate.Subsequently, we visualized the time tree using the Figtree software (https://github.com/rambaut/figtree).The divergence time distance between H. assulta and H. armigera was estimated to be around 6.2 million years.
To analyze the expansion and contraction of gene families, we utilized the matrix tables of gene family orthologs obtained from OrthoFinder results.We applied these tables as inputs in CAFE 49 v5.0.0 and set a cut-off p-value of <0.05, allowing us to examine each gene family's expansion and contraction (Fig. 7).

Data records
The Nanopore, Hi-C, and Illumina sequencing data used for the genome assembly and annotation have been submitted to the European Nucleotide Archive (ENA) with accession number PRJEB70911 50 .The final chromosome assembly has been submitted to the National Genomic Data Center (NGDC) under the accession GCA_963856015.1 51 .The H. armigera genome was downloaded from the NCBI genome database 52 .All public RNA-seq datasets used in the gene expression analysis were downloaded from the NCBI SRA database, and the corresponding project IDs were PRJEB6594 53 , PRJNA587871 54 , PRJNA590047 55 , PRJNA592822 56 , and PRJNA261645 57 .

technical Validation
The chromosome-level primary genome assembly was 415.19 Mb.For quantitative assessment of genome assembly, BUSCO 18 analysis results showed that 99.0% of BUSCO genes (insecta_odb10) were successfully identified in the genome assembly, suggesting a remarkably complete assembly of the H. assulta genome.In addition, the genome alignment rate of HiFi reads is as high as 99.98%.The Hi-C heatmap revealed a well-organized interaction contact pattern along the diagonals within/around the chromosome inversion region, which indirectly confirmed the accuracy of the chromosome assembly.
To verify the completeness of our genome chromosome assembly, we conducted an analysis of telomere repeat sequences on each chromosome based on the genome repeat sequence annotation results.Initially, we analyzed the telomere repeat motif sequences of Lepidoptera insects in TeloBase 58 , and we found that the majority of repeat motifs ranged from 5 to 9 bp in length, and (TTAGG)n/(CCTAA)n is the main motif in telomeres.After that, we identified regions within 15 kb at both ends of the chromosomes in our results where the length of repeat sequences exceeded 1k,and the repeat motif sequence ranged from 5 to 9 bp.Based on our analysis, we found that 21 chromosomes contain the typical telomeric motif (TTAGG)n/(CCTAA)n or a variant of the motif within 15 kb at both ends, while the remaining 11 chromosomes have the typical telomeric motif or a variant of the motif on at least one end (Table 9).
In our investigation of sex chromosome determination, we utilized minimap2 to align genome contigs from two haplotypes (paternal and maternal) generated by the Hifiasm program with the primary final genome.The alignment revealed that contigs from the paternal haplotype could be aligned with all chromosomes except Chr2, while those from the maternal haplotype could be aligned with all chromosomes except Chr1 (Fig. 8).It is well-established that the sex determination in tobacco hornworms relies on two sex chromosomes, Z and W, where females possess a Z-W genotype while males have Z-Z.For this study, we employed single-headed female insects as the experimental material for genome sequencing.The analysis above reaffirmed our conclusion that Chr1 is the Z sex chromosome and Chr2 is the W sex chromosome.

Fig. 1 K
Fig. 1 K-mer spectra and fitted models for H. assulta based on Illumina short-read reads and PacBio HiFi reads using a 21-mer count histogram.(a) K-mer spectra and fitted models for H. assulta based on Illumina shortread reads.(b) K-mer spectra and fitted models for H. assulta based on PacBio HiFi reads.

Fig. 2
Fig. 2 Heat map of Hi-C assembly of H. assulta.The scale bar represents the interaction frequency of Hi-C links.

Fig. 4
Fig.4 Characterization of the H. assulta genome.Circos plot of chromosome level genome assembly (~415.19Mb) and the distribution of GC content, gene frequency, and ncRNA frequency on 32 chromosomes.

Fig. 5
Fig.5 The coverage ratios of male/female for each chromosome.Each point represents a single chromosome.The dotted red line shows the expectation for the Z chromosome.

Fig. 6 Fig. 7
Fig. 6 Chromosomal synteny plot of H. assulta and H. armigera genomes.The dark yellow strip at the top represents the chromosomes of the H. assulta, while the light green strip at the bottom represents the chromosomes of the H. armigera.

Fig. 8
Fig. 8 The alignment dot plot of haplotype assemblies Hap1 and Hap2 with the primary reference genome.(a) The alignment dot plot of haplotype assemblies Hap1 (paternal haplotype) with the primary reference genome.(b) The alignment dot plot of haplotype assemblies Hap2 (maternal haplotype) with the primary reference genome.The two haplotype draft genome sequences 60 have been deposited into Figshare.com.

Table 1 .
MB, and 404.67 MB, Summary statistics of the Illumina HiFi reads.

Table 2 .
Summary statistics of the Illumina genomic DNA short reads.

Table 3 .
Summary statistics of the Oxford Nanopore raw reads.

Table 4 .
Summary statistics of three draft Hifiasm assemblies for H. assulta.

Table 5 .
Summary statistics of the final H. assulta genome assemble.

Table 6 .
Summary statistics of the Illumine RNA-seq short reads.

Table 7 .
MB, and 404.67 MB, Repeats elements statistics of the H. assulta genome.

Table 8 .
The statistical data on chromosome length, total gene count, and number of expressed genes.Note: